Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation

نویسندگان

  • Marcel Bollmann
  • Florian Petran
  • Stefanie Dipper
چکیده

This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91% exact matches, clearly outperforming the baseline (65%). Matches can be improved to 93% by combining the approach with a word substitution list. If applied to more diverse language data from roughly the same period, performance goes down to 42% exact matches (baseline: 32%), but is higher than using a wordlist. The results show that rules derived from a highly different type of text can support normalization to a certain extent.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalizing Medieval German Texts: from rules to deep learning

The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....

متن کامل

Rule-Based Normalization of Historical Texts

This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rulebased approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. The eva...

متن کامل

POS Tagging for Historical Texts with Sparse Training Data

This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th cen...

متن کامل

A Study on the Commentary of Historical Verses with an Emphasis on the Rule of Al-Ibrah

One of the prevalent commentary rules about commentary of the historical verses which has a certain revelation occasion and refers to a specific time and place is the rule of alibrah being stated as: take in consideration universality of the word not particularity of the occasion. The source of this rule refers to the verses which have universal word and particular occasion. The referent of the...

متن کامل

Normalization of qPCR array data: a novel method based on procrustes superimposition

MicroRNAs (miRNAs) are short, endogenous non-coding RNAs that function as guide molecules to regulate transcription of their target messenger RNAs. Several methods including low-density qPCR arrays are being increasingly used to profile the expression of these molecules in a variety of different biological conditions. Reliable analysis of expression profiles demands removal of technical variati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011